Introduction to Core Spark Concepts

RDD

Resilient distributed dataset (RDD) is Spark’s core abstraction for working with data. An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

RDDs can contain a collection of elements of any type - Strings, Lines, rows, objects, and collections

You can also persist, or cache, RDDs in memory or on disk. • Spark RDDs are fault-tolerant. If a given node or task fails, the RDD can be reconstructed automaGcally on the remaining nodes and the job will complete.

Spark Architecture

Spark uses a master/slave architecture. At a high level, every Spark application consists of two processes, a driver program that runs on the driver machine and multiple executor processes each running on the cluster (worker) nodes or in local threads.

Note: In case of Standalone/Single Box (Windows or Linux), Spark runs on JVM. All components (driver, executors) run within the same JVM.

Let us understand terminology used in the above graphic.

Driver Program

Driver is the main executable program from where Spark operations are performed. The driver is the process where the main() method of your program runs. It is the process running the user code that creates a SparkContext, creates RDDs, and performs transformations and actions. Driver Program controls and co-ordinates all operations. Each driver program execution is a “Job”. Once the driver terminates, the application is finished.

When the driver runs, it performs two duties:

Converting a user program into tasks: The Spark driver is responsible for converting a user program into units of physical execution called tasks. At a high level, all Spark programs follow the same structure: they create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations. When the driver runs, it converts this logical graph into a physical execution plan.

Spark performs several optimizations, such as “pipelining” map transformations together to merge them, and converts the execution graph into a set of stages. Each stage, in turn, consists of multiple tasks. The tasks are bundled up and prepared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a typical user program can launch hundreds or thousands of individual tasks.

Scheduling tasks on executors: Given a physical execution plan, a Spark driver must coordinate the scheduling of individual tasks on executors. When executors are started they register themselves with the driver, so it has a complete view of the application’s executors at all times. Each executor represents a process capable of running tasks and storing RDD data.

The Spark driver will look at the current set of executors and try to schedule each task in an appropriate location, based on data placement. When tasks execute, they may have a side effect of storing cached data. The driver also tracks the location of cached data and uses it to schedule future tasks that access that data. The driver exposes information about the running Spark application through a web interface, which by default is available at port 4040. For instance, in local mode, this UI is available at http://localhost:4040.

Spark Context

Driver programs access Spark through a SparkContext object. SparkContext represents a connection to the computing cluster. It is used to build RDDs and to manage executors running on Worker nodes. It splits jobs as parallel “tasks” and executes them on worker nodes. It can also partition RDDs and distribute them on to cluster. Finally collects results and presents them to driver program

Executors

The driver communicates with a potentially large number of distributed workers called executors. The driver runs in its own Java process and each executor is a separate Java process.

Spark executors are worker processes responsible for running the individual tasks in a given Spark job. Executors are launched once at the beginning of a Spark application and typically run for the entire lifetime of an application, though Spark applications can continue if executors fail. Executors have two roles. First, they run the tasks that make up the application and return results to the driver. Second, they provide in-memory storage for RDDs that are cached by user programs, through a service called the Block Manager that lives within each executor. Because RDDs are cached directly inside of executors, tasks can run alongside the cached data.

Note: Spark driver runs along with an executor in the same Java process. This is a special case; executors typically each run in a dedicated process.

Cluster Manager

So far we’ve discussed drivers and executors in somewhat abstract terms. But how do drivers and executor processes initially get launched? Spark depends on a cluster manager to launch executors and, in certain cases, to launch the driver. The cluster manager is a pluggable component in Spark. This allows Spark to run on top of different external managers, such as YARN and Mesos, as well as its built-in Standalone cluster manager.

Note: Terms driver and executor are used todescribe the processes that execute each Spark application. The terms master and worker are used to describe the centralized and distributed portions of the cluster manager. It’s easy to get confused with these terms. For instance, Hadoop YARN runs a master daemon (called the Resource Manager) and several worker daemons called Node Managers. Spark can run both drivers and executors on the YARN worker nodes.

Launching a Program

Spark provides a single script called spark-submit to submit applications. Using various options, sparksubmit can connect to different cluster managers and control how many resources the application gets. For some cluster managers, spark-submit can run the driver within the cluster (e.g., on a YARN worker node), while for others, it can run it only on your local machine.

Summary

To summarize the concepts in this section, let’s walk through the exact steps that occur when you run a Spark application on a cluster:

The user submits an application using spark-submit.
spark-submit launches the driver program and invokes the main() method specified by the user.
The driver program creates a SparkContext object. This tells Spark how and where to access a cluster
The driver program contacts the cluster manager using SparkContext and asks for resources to launch executors.
The cluster manager launches executors on behalf of the driver program on the worker nodes.
Jar or python files passed to the SparkContext are then sent to the executors.
The driver process runs through the user application. Based on the RDD actions and transformations in the program, the driver sends work to executors in the form of tasks using SparkContext.
Tasks are run on executor processes to ingest, compute and save results.
If the driver’s main() method exits or it calls SparkContext.stop(), it will terminate the executors and release resources from the cluster manager.

Introduction to Core Spark Concepts